Newspaper Archive API icon

Newspaper Archive API

(0 reviews)

Articles endpoint

Article endpoint exposes additional data that enriches “page centric” data available in Pages endpoint with textual articles that are on those pages. You can fetch article data together with their coordinates (allowing to visualise the position of them on the page) and textual format with markup (html) allowing to present it in a more customisable, more “reader friendly” way.

This endpoint also allows you to tap into a stream of all published articles and fetch them as soon as they are published.

Currently articles are available for all products of Norwegian brands and only for products published after 1st of February 2025. Swedish brands will be available at a later time.

The article endpoint is still under development meaning parameters and values are subject to minor changes.

Data structure

Structure of data in the archive and description of concepts used in the documentation:

  • Product is the main element of data stored in the archive. Product is for example "SvD Perfect Guide" and has a characteristic that for given issue of product on given day consumers see it as a list of pages from 1 to N.
  • Product code is the code that identifies a type of a product like for example SVPG ("SvD Perfect Guide") or SVNY ("SvD Main newspaper") or SARB ("Stavanger Aftenblad editorial appendix"). Full list of product codes supported by this API is available on the page List of available products.
  • ProductIssueId - because there may be multiple issues of product with one product code on one day, productIssueId may be used to identify one specific issue of a product on given day. For example there may be multiple advertisement appendixes on given day and all of them will have the same product code (for example APAB) but each of them will have a different productIssueId
  • PageNumber - page number within product. For given product issue on a given day it is always increasing from 1 to N. It generally corresponds with the page number that is printed on the page, but for older material (before 2010) there may be some exceptions to this rule.
  • Articles - The article with its content and metadata.

Search functionality

Article search request

Main way for searching within the archive for articles is to use endpoint:

GET /paper-archive/api/v2/article/search

Example request:

curl https://api.schibsted.com/paper-archive/api/v2/article/search?q=SPECIALISTREKRYTERING&newspaperBrand=SVENSKA_DAGBLADET&productCode=SVNY&productCode=SVMD&startDate=2000-01-05&endDate=2000-01-05&offset=0&sort=DATE&size=50

The endpoint accepts following parameters:

  • q - (text)

Query as it was entered by the user. Query can contain multiple words or it can contain a phrase in quotes. If empty all articles matching other criteria will be returned.

  • startDate - (ISO.DATE, for example "2011-12-03")

Start of a timespan that the search should cover (including given date), if submitted endDate also must be submitted.

  • endDate - (ISO.DATE, for example "2011-12-03")

End date of a timespan that the search should cover (including given date), if submitted startDate also must be submitted.

  • newspaperBrand (text)

The brand of newspaper. If it is not specified then the search will be performed in all newspaper brands. The parameter may be specified multiple times to search in multiple brands. Parameter accepts one of following values:

  • AFTENPOSTEN
  • AFTONBLADET
  • BERGENS_TIDENDE
  • SVENSKA_DAGBLADET
  • STAVANGER_AFTENBLAD
  • VG
  • SCHN - Schibsted non-branded Norwegian
  • EXTN - External Norwegian

For one newspaper brand and day we may have multiple products available.

Limitations in brand search

As of launch, only Norwegian brands are available, attempting to search for the Swedish brands via either requesting brands or product codes to a Swedish brand will result in a 403, also attempting a search all by not sending either product code or brand will also result in a 403.

  • productCode (text)

One or more of products codes to which search should be limited. List of all available product codes is available on page List of available products. If this parameter is not specified search will be performed in all products. The parameter may be specified multiple times to search in multiple products. This parameter also supports "extended product codes" meaning products codes with number at the end like "SAAB1", "SAAB2" etc. You can use both short codes like for example "SAAB" to find all Stavanger Aftenblad advertisements or specific "extended code" like "SAAB1" to find only advertisement with that code. All products that support "extended code" are listed in List of available products.

  • productIssueId (string)

All articles that have given productIssueId will be returned. It may be used in a use case when you have one article from previous search result and you want to find all other articles of the product that were published on the same day as original. productIssueId is globally unique so you don't have to specify day, productCode or brand when you pass productIssueId. You may (but don't have to) pass pageNumber or edition or any other filter together with productIssueId if you need to limit the results.

  • imageUrlValidity (int)

This parameter specifies how long the returned image URLs should be valid (in minutes). Default is 30. Maximum allowed value is 7 days.

  • pageNumber (int)

Limit search only to pages with given page number.

  • pageId (string)

Limit search only to articles with a given page id.

  • articleId (string)

Limit search to one or more articles with the id of the article.

  • externalArticleId (string)

Limit search only to articles with a given externalArticleId.

  • excludedProductCode (text)

Filter out one or more product codes from the results.

  • updatedLaterThan (ISO.DATE 8601 date and time)

Limit search to only include articles that have been added or modified in the archive more recent than given time. Format is in ISO 8601 date and time for example "2024-10-27T03:33:20Z".

  • showDeleted (Boolean)

This field can be used in a case when a caller wants to get informed when an article that was previously available has been deleted. As default the api doesn't return information about deleted articles (meaning this value is false by default). If you pass “true” as a value for “showDeleted” you will get short information about all deleted articles that satisfy your search parameters along with the non deleted.

  • sort, sortOrder (text)

Sort parameter has got to have one of values: DATE, RELEVANCE. RELEVANCE means that most relevant result (scored by search engine) will be at top. sortOrder has got one of values: ASC, DESC.

  • offset, size (int)

Parameters used for paging. Size determines how many result items should be returned and offset how many result items should be skipped (starting from the top of the list). For size: maximum allowed value is 1000 and the default is 10. Also there is a limitation that sum of offset + size cannot be greater than 10000.

Article search response

Example response:

{
   "hits": [
       {
           "articleId": "SVBI_20210201_3_A002",
           "date": "2021-02-01",
           "articleImageUrls": [
               "https://www.example.com/image1.jpg",
               "https://www.example.com/image2.jpg"
           ],
           "pageIds": [
               "ABNY_20161115_0008_5_idx1"
           ],
           "pageNumbers": [
               8
           ],
           "newspaperBrand": "SVENSKA_DAGBLADET",
           "productCode": "SVBI",
           "section": "Debatt",
           "textMarkup": "<h1 id=\"d5e6632\">Inkorgen Behövs verkligen alla de 349 riksdagsmĂ€nnen?</h1>\n      <p class=\"lead\" id=\"d5e6635\">Skriv till oss pĂ„ <em>inkorgen@aftonbladet.se.</em> Du kan anvĂ€nda signatur, men ange namn i kontakt med redaktionen. Skriv kort – vi förbehĂ„ller oss rĂ€tten att redigera insĂ€nt material.</p>\n      <p id=\"d5e6641\">Brukar dĂ„ och dĂ„ slötitta pĂ„ tv pĂ„ utsĂ€ndningen frĂ„n Riksdagen pĂ„ förmiddagarna.</p>\n      <p id=\"d5e6644\">Jag fascineras av att det Ă€r sĂ„ mĂ„nga riksdagsmĂ€n och ministrar som stĂ„r, inför mestadels tomma stolar, och lĂ€ser innantill vad ”nĂ„gon annan” uppenbarligen har skrivit.</p>\n      <p id=\"d5e6647\">SĂ„, behövs det verkligen sĂ„ mĂ„nga som 349 riksdagsmĂ€n?</p>\n      <p class=\"author\" id=\"d5e6651\">\n        <strong>Thomas Montgomery</strong>\n      </p>\n      <figure class=\"image\">\n        <figcaption>\n          <p id=\"d5e6661\">\n            <strong>\n              <em>MĂ„nga tomma stolar.</em>\n            </strong>\n          </p>\n          <p id=\"d5e6666\">Foto: TT</p>\n        </figcaption>\n        <img alt=\"figur\" id=\"d5e6655\" src=\"images/aftonbladet_20241209-024.jpg\"/>\n      </figure>",
           "text": "Inkorgen Behövs verkligen alla de 349 riksdagsmĂ€nnen?\n\nSkriv till oss pĂ„ inkorgen@aftonbladet.se. Du kan anvĂ€nda signatur, men ange namn i kontakt med redaktionen. Skriv kort – vi förbehĂ„ller oss rĂ€tten att redigera insĂ€nt material.\n\nBrukar dĂ„ och dĂ„ slötitta pĂ„ tv pĂ„ utsĂ€ndningen frĂ„n Riksdagen pĂ„ förmiddagarna.\n\nJag fascineras av att det Ă€r sĂ„ mĂ„nga riksdagsmĂ€n och ministrar som stĂ„r, inför mestadels tomma stolar, och lĂ€ser innantill vad ”nĂ„gon annan” uppenbarligen har skrivit.\n\nSĂ„, behövs det verkligen sĂ„ mĂ„nga som 349 riksdagsmĂ€n?\n\nThomas Montgomery\n\nMĂ„nga tomma stolar.\n\nFoto: TT",
           "headline": "Inkorgen Behövs verkligen alla de 349 riksdagsmÀnnen?",
           "revision": 1,
           "externalArticleId": "754573",
           "metadataFields": {
               "author": "Thomas Montgomery"
           },
           "lastUpdated": "2024-10-27T03:33:20Z",
           "isDeleted": false,
           "epaperUrl": "https://www.paper-archive.core.schibsted.io/paywall/article/SVBI_20210201_3_A002_rev2"
       },
       {
           "id": "ABSO_20210401_7_A004",
           "date": "2021-04-01",
           "newspaperBrand": "AFTONBLADET",
           "productCode": "ABSO",
           "revision": 1,
           "externalArticleId": "02040",
           "lastUpdated": "2021-04-01T13:20:00Z",
           "isDeleted": true
       }
   ],
   "total": 2
}

Description of fields in the response:

  • total - total number of matching items (may be capped at some value)
  • hits - list of matching items - contains pages of newspaper matching search request
  • articleId - unique identifier of result item i.e. article
  • date - date on which given article was published
  • articleImageUrls - URLs to images presented in the article. These URLs are valid only for limited time that is dependent on a request parameter imageUrlValidity (maximum 7 days). This list corresponds to the images in tags in textMarkup (it’s exactly the same number of images in “articleImageUrls” as in tags).
  • pageIds - page id that corresponds with the page id that the article is printed on, can be multiple pages if the article spans multiple pages. It corresponds with the “id” parameter of the page endpoint.
  • pageNumbers - page number that corresponds with the page number that article is printed on, can be multiple pages if the article spans multiple pages.
  • newspaperBrand - brand of newspaper like for example SVD (for a list of available values - see search request).
  • productCode - code of a product like for example SVPG ("SvD Perfect Guide"). It often doesn't correspond with what we think of as a "newspaper" that a customer buys in store, because what is bought often contains multiple products bundled together like for example SvD News product + SvD Business product.
  • textMarkup - The content of the article with markup in HTML format. It can contain following list of HTML tags:
    • <a>, <aside>, <br>, <em>, <figcaption>, <figure>, <h1>, <h2>, <h3>, <h4>, <hr>, <img>, <li>, <ol>, <p>, <section>, <strong>, <sup>, <ul>, <blockquote>
  • text - The text of the article in a pure text format.
  • headline - The headline of the article.
  • metadataFields - This field contains mapping of different metadata values about the article from the source system that is saved, such as author and more, these values are not always present on each article and not always consistent.
  • section - Some brands and products use “section” to describe the type or topic of an article, like “Ledarsida”, “Debatt”, “FORSTESIDA” or “Sports”, this value is taken straight from the raw data meaning differences in language and usage between different products.
  • externalArticleId - The article Id from source publishing system. There are multiple source publishing systems and there's no guarantee that this ID is unique.
  • lastUpdated - The timestamp of when the article was indexed in the archive. With every change in the article content "lastUpdated" will change too.
  • revision - The version number of the article, if an article gets revised it will get an increment number in the revision value. Only the newest revision of the article is returned and “revision” can be used to discover a change in article content.
  • epaperUrl - The link to the website where one can access and read the article, behind a paywall. This link is a long-term valid link and you can expect that when fetching the page you will be redirected.
  • productIssueId - Identifier of the product issue to which given page belongs. All pages for given product on a given day have got the same productIssueId (like for example all pages for first advertisement appendix for SvD on given day have the same productIssueId, but second advertisement appendix on the same day has got different productIssueId). It's a globally unique identifier. It can be used in a usecase when you have one page that comes from search results and you want to find all other pages of that product on the same day.
  • highlightText - Fragment of page text matching the search query (it contains searched word in a <em> tag and a few words around). “highlightText” is only present if a “q” parameter was passed in the search criteria.
  • isDeleted - This field is only relevant if you pass “showDeleted” criteria. In that case you If the article has been deleted, deleted articles will be in another much smaller format of only id, date, brand and productCode and lastUpdated.

Typical use cases

  1. Showing an “article read mode” for an article on a given page

In that case you can either fetch a given article by “articleId” (knowing it from page data coming from page endpoint) or if you want to have a list of articles by “productIssueID” and “pageNumber”. Then you can use the textMarkup field to render the article using your own stylesheet.

  1. Fetching a stream of newly published articles

If you want to get access to all published articles then you can call articles endpoint in some regular intervals (like for example 30 seconds) with given “updatedLaterThan” criteria. It is recommended to have “updatedLaterThan” at least 30 minutes longer than the time of your last call to compensate for any delay in making new data available. If you want to know which of the articles in the returned article list have changed from your last call you can use either “revision” or “lastUpdated” field.

It’s also recommended to enable the “showDeleted” flag in search criteria and then delete any information about articles that have “isDeleted” set in the results.

  1. Search for given text in articles

In that case you usually have to pass “q” parameter with the query and then you can optionally limit the search by providing startDate and endDate and/or limit to one newspaperBrand or specific productCode.


Reviews